Module 03
Variable names (those created by <- and those created by mutate()) should use only lowercase letters, numbers, and _.
Use _ to separate words within a name.
Tip
Use “long, descriptive names that are easy to understand rather than concise names that are fast to type.”
Put spaces on either side of mathematical operators apart from ^ (i.e. +, -, ==, <, …), and around the assignment operator (<-).
Note
Python code style guide: PEP 8, has similar recommendations for spaces; however, it recommends not using space around the = sign when used to indicate a keyword argument or a default parameter value.
Using line returns (after the comma) to separate arguments in a function call is a good practice.
|> should always have a space before it and should typically be the last thing on a line.
If the function you’re piping into has named arguments (like mutate() or summarize()), put each argument on a new line. If the function doesn’t have named arguments (like select() or filter()), keep everything on one line unless it doesn’t fit, in which case you should put each argument on its own line.
After the first step of the pipeline, indent each line by two spaces. RStudio will automatically put the spaces in for you after a line break following a |> . If you’re putting each argument on its own line, indent by an extra two spaces. Make sure ) is on its own line, and un-indented to match the horizontal position of the function name.
# Strive for
flights |>
group_by(tailnum) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# Avoid
flights|>
group_by(tailnum) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# Avoid
flights|>
group_by(tailnum) |>
summarize(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)It’s OK to shirk some of these rules if your pipeline fits easily on one line. But in our collective experience, it’s common for short snippets to grow longer, so you’ll usually save time in the long run by starting with all the vertical space you need.
The same basic rules that apply to the pipe also apply to ggplot2; just treat + the same way as |>. - R4DS (2e)
Because I work almost exclusively in markdown, headers serve the same function as sectioning comments in R scripts.
Tip
Demonstrate the use of named code chunks (blocks) in Quarto.
Module 3 Exercise 1
flights|>filter(dest=="IAH")|>group_by(year,month,day)|>summarize(n=n(),
delay=mean(arr_delay,na.rm=TRUE))|>filter(n>10)
flights|>filter(carrier=="UA",dest%in%c("IAH","HOU"),sched_dep_time>
0900,sched_arr_time<2000)|>group_by(flight)|>summarize(delay=mean(
arr_delay,na.rm=TRUE),cancelled=sum(is.na(arr_delay)),n=n())|>filter(n>10)Happy families are all alike; every unhappy family is unhappy in its own way.
— Leo Tolstoy
Tidy datasets are all alike, but every messy dataset is messy in its own way.
— Hadley Wickham
Tidy or not tidy? (How would you do a statistical analysis?)
Tidy or not tidy?
Tidy or not tidy?
Tidy or not tidy?
Tidy or not tidy?
Question: If tidy data is so good, why do we encounter so many untidy datasets?
Note
We will use the pivot_wider() and pivot_longer() functions to tidy data; however, you need to be familiar with past functions that have been used to “reshape” data as they are commonly found in the wild.
cast()melt()cast()melt()spread() supersededgather() supersededpivot_wider()pivot_longer()For some time, it’s been obvious that there is something fundamentally wrong with the design of
spread()andgather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering.It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.
-Pivoting
pivot_longer()Sometime you need a human readable table!
When to use script:
When you start a new project, create a new project folder and work from that folder.
digraph {
rankdir=TB;
node [shape=folder, style="filled", fillcolor="#ECECEC", fontsize=10, width=0.5, height=0.3, margin="0.1,0.1"];
edge [arrowhead=none, penwidth=0.5];
Project [label="Project Folder"];
Data [label="datasets"];
Scripts [label="docs"];
Output [label="images"];
Project -> Data;
Project -> Scripts;
Project -> Output;
{rank=same; Data; Scripts; Output}
}[1] "cassette" "wafer" "site" "linewidt" "runseq"
Number of observations = 180
Number of observations per line image = 5
Order of variables on a line image:
Anything that is ordered and could be a character string should be a factor.
read_ statementmutate()factor() within mutate()machine_factor <- machine |>
mutate(
machine = factor(machine, ordered = FALSE,
levels = c(1, 2, 3),
labels = c("x2398", "x0023", "z1000")),
day = factor(day, ordered = TRUE,
levels = c(1, 2, 3), labels = c("Mon", "Tue", "Wed")),
time = factor(time, ordered = TRUE,
levels = c(1, 2), labels = c("AM", "PM"))
)
machine_factorPlot of machine data
machine_factor |>
ggplot(aes(x = machine, y = diameter, color = machine)) +
geom_point(position = position_jitter(0.1), show.legend = FALSE) +
geom_hline(yintercept = 0.125, linetype = "dashed", color = "black") +
geom_hline(yintercept = c(0.128, 0.122), linetype = "dashed", color = "red") +
facet_wrap(vars(time)) +
labs(
title = "Review of Tool Performance",
subtitle = "Tool z1000 performance on AM shift is underperforming\nTool x0023 needs centering adjustment",
x = "Machine Number",
y = "Diameter (inches)",
caption = "Source: MACHINE.DAT"
) +
scale_y_continuous(limits = c(0.12, 0.13), breaks = seq(0.12, 0.13, 0.0025)) +
theme_linedraw()Review the Import Dataset wizard in RStudio.
Requires the readxl library.
Review “Analyzer study from XW.xlsx”
Using the base R function, list.files()
Take the dataframe and write to a file
write_csv()write_tsv()write_table()write_xlsx()tibble() (aka a data frame)Applied Statistical Techniques